INTERNATIONAL WORKSHOP MuLTILINguAL RESOuRcES, TEcHNOLOgIES ANd EvALuATION fOR cENTRAL ANd EASTERN EuROPEAN LANguAgES
نویسندگان
چکیده
This paper discusses the building of the first Bulgarian– Polish–Lithuanian (for short, BG–PL–LT) experimental corpus. The BG–PL–LT corpus (currently under development only for research) contains more than 3 million words and comprises two corpora: parallel and comparable. The BG–PL– LT parallel corpus contains more than 1 million words. A small part of the parallel corpus comprises original texts in one of the three languages with translations in two others, and texts of official documents of the European Union available through the Internet. The texts (fiction) in other languages translated into Bulgarian, Polish, and Lithuanian form the main part of the parallel corpus. The comparable BG–PL–LT corpus includes: (1) texts in Bulgarian, Polish and Lithuanian with the text sizes being comparable across the three languages, mainly fiction, and (2) excerpts from E-media newspapers, distributed via Internet and with the same thematic content. Some of the texts have been annotated at paragraph level. This allows texts in all three languages and in pairs BG–PL, PL–LT, BG–LT, and vice versa to be aligned at paragraph level in order to produces aligned threeand bilingual corpora. The authors focused their attention on the morphosyntactic annotation of the parallel trilingual corpus, according to the Corpus Encoding Standard (CES). The tagsets for corpora annotation are briefly discussed from the point of view of possible unification in future. Some examples are presented.
منابع مشابه
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...
متن کاملExtending an Information Extraction tool set to Central and Eastern European languages
In a highly multilingual and multicultural environment such as in the European Commission with soon over twenty official languages, there is an urgent need for text analysis tools that use minimal linguistic knowledge so that they can be adapted to many languages without much human effort. We are presenting two such Information Extraction tools that have already been adapted to various Western ...
متن کاملAn Archive For All Of Europe
TRACTOR is the TELRI Research Archive of Computational Tools and Resources. It features monolingual, bilingual, and multilingual corpora and lexicons in a wide variety of languages, as well as tools for language processing. TRACTOR is a key element of TELRI II, a panEuropean alliance of focal national language technology institutions with the emphasis on Central and Eastern European and NIS cou...
متن کاملTransculturation and Multilingual Lives: Writing between Languages and Cultures
This paper looks at the issues of transculturation as explored in auto and semi-autobiographical accounts of linguistic and cultural transitions. The paper also addresses a number of questions about the structure of these texts, the authors’ linguistic competences, as well as questions about the theoretical and conceptual tool which may help us to discuss the issues the writers are reflecting o...
متن کاملMULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotation...
متن کامل